91 research outputs found
On the Feasibility and Limitations of Just-in-Time Instruction Set Extension for FPGA-Based Reconfigurable Processors
Reconfigurable instruction set processors provide the possibility of tailor the instruction set of a CPU to a particular application. While this customization process could be performed during runtime in order to adapt the CPU to the currently executed workload, this use case has been hardly investigated. In this paper, we study the feasibility of moving the customization process to runtime and evaluate the relation of the expected speedups and the associated overheads. To this end, we present a tool flow that is tailored to the requirements of this just-in-time ASIP specialization scenario. We evaluate our methods by targeting our previously introduced Woolcano reconfigurable ASIP architecture for a set of applications from the SPEC2006, SPEC2000, MiBench, and SciMark2 benchmark suites. Our results show that just-in-time ASIP specialization is promising for embedded computing applications, where average speedups of 5x can be achieved by spending 50 minutes for custom instruction identification and hardware generation. These overheads will be compensated if the applications execute for more than 2 hours. For the scientific computing benchmarks, the achievable speedup is only 1.2x, which requires significant execution times in the order of days to amortize the overheads
A Massively Parallel Algorithm for the Approximate Calculation of Inverse p-th Roots of Large Sparse Matrices
We present the submatrix method, a highly parallelizable method for the
approximate calculation of inverse p-th roots of large sparse symmetric
matrices which are required in different scientific applications. We follow the
idea of Approximate Computing, allowing imprecision in the final result in
order to be able to utilize the sparsity of the input matrix and to allow
massively parallel execution. For an n x n matrix, the proposed algorithm
allows to distribute the calculations over n nodes with only little
communication overhead. The approximate result matrix exhibits the same
sparsity pattern as the input matrix, allowing for efficient reuse of allocated
data structures.
We evaluate the algorithm with respect to the error that it introduces into
calculated results, as well as its performance and scalability. We demonstrate
that the error is relatively limited for well-conditioned matrices and that
results are still valuable for error-resilient applications like
preconditioning even for ill-conditioned matrices. We discuss the execution
time and scaling of the algorithm on a theoretical level and present a
distributed implementation of the algorithm using MPI and OpenMP. We
demonstrate the scalability of this implementation by running it on a
high-performance compute cluster comprised of 1024 CPU cores, showing a speedup
of 665x compared to single-threaded execution
FPGA Acceleration of Communication-Bound Streaming Applications: Architecture Modeling and a 3D Image Compositing Case Study
Reconfigurable computers usually provide a limited
number of different memory resources, such as host
memory, external memory, and on-chip memory with different
capacities and communication characteristics. A key
challenge for achieving high-performance with reconfigurable
accelerators is the efficient utilization of the available memory
resources. A detailed knowledge of the memories' parameters
is key for generating an optimized communication layout. In this paper, we discuss a benchmarking environment
for generating such a characterization. The environment is
built on IMORC, our architectural template and on-chip
network for creating reconfigurable accelerators. We provide
a characterization of the memory resources available on the
XtremeData XD1000 reconfigurable computer. Based on this
data, we present as a case study the implementation of a 3D
image compositing accelerator that is able to double the frame rate
of a parallel renderer
- …